NOTE: This is a pre-publication release.

Version 0.1

This repository includes details for replicating the results in:

Wenxin Jiang, Gary King, Allen Schmaltz, and Martin A. Tanner. 2018. "Ecological  Regression with Partial Identification". (under review)


## Figures

To reproduce the graphs in the paper, from this directory run

bash generate_figures.sh

This will (re-)create figures 1-3, placing them in the folder output/figures.

## Simulations (Section 6)

The code for replicating Examples 3, 4, 5 in Section 6 is in the script
analysis/example_simulations.R.

This can be run from this directory as follows, (re-)saving the expected output to
output/expected_output/simulation_output.txt:

Rscript analysis/example_simulations.R >output/expected_output/simulation_output.txt

## Real data analyses (Section 7)

The data is a combination of existing data sources available from available
R libraries and new data sources collected from the CDC, the census of India,
and the US census. Since the new data sources each have their own licenses,
we provide instructions here for downloading from the original sources and then
pre-processing the data to replicate the results in the paper.


1. To download the CDC data by gender, take the following steps:

Go to https://wonder.cdc.gov/UCD-ICD10.html. If you agree to the terms of service,
click "I agree".

2. In the "Underlying Cause of Death, 1999-2016 Request", choose the following options (leaving everything else at the defaults):

Group Results By: County
And By: Gender

Ten-Year Age Groups:
< 1 year
1-4 years
5-14 years

Race:
Black or African American
White

Year/Month:
[select 1999-2015]

Export Results: [select]
Show Zero Values: [select]
Show Suppressed Values: [select]

Save the file to data/datasets as 'UnderlyingCauseofDeath1999-2015_bygender_black_white_agelessthan14.formatted.txt'.

3. To download the CDC data by race, repeat the above steps, replacing "And By: Gender" with "And By: Race".

Save the file to data/datasets as 'UnderlyingCauseofDeath1999-2015_byrace_black_white_agelessthan14.formatted.txt'.

4. From this directory run

bash format_cdc_data.sh

This will format the CDC data (as a 2x2 EI table) for subsequent steps.

5. To download the India census data, take the following steps:

Go to https://data.gov.in/catalog/population-attending-educational-institution-completed-education-level-age-and-sex-census and download the .xls file (District-wise Population Attending Educational Institution by Completed Education Level, Age and Sex, 2001 - [state/territory name]) for each available state/territory of India. There should be 35 files in total. (We downloaded the data on November 2, 2017.) Save all of the .xls files to the data/datasets/india_census directory.

6. From this directory run

bash format_india_data.sh

7. To download the IPUMS data:

Create an account at https://usa.ipums.org/usa/ and log-in.

8. Since the combined files are relatively large, we separately download the full census files, the ACS sample files, and the sampled census files.

For the full census data (after logging-in to https://usa.ipums.org/usa/ and clicking "Get Data"):

Select Samples: USA Full Count: 1850,1880,1900,1910,1920,1930,1940

Select the following variables:

YEAR
STATEICP
COUNTY
MCD
HHWT
PERWT
DATANUM
SERIAL
PERNUM
SEX
LIT
RACE
LABFORCE
NATIVITY
SPEAKENG
CLASSWKR
VETSTAT
EMPSTAT

Create the extract (retaining all other values at their default settings). (Note that with the 100% census data, you will need to agree to the special usage terms in order to download.) The full .csv file may be over 60 GB. To reduce the file size, you can download years or variable groups separately, making use of the --run_set argument in the following script (ei_preprocessing_ipums_full_census.py).

Save the resulting .csv file and note the location (here, we use the placeholder 'usa_0000xfullcensus.csv'). Add the following line to the end of the file:

"9999,,,,,,,,,,,,,,,,,,,,,,,,,"

as, for example, by running:

echo "9999,,,,,,,,,,,,,,,,,,,,,,,,," >> usa_0000xfullcensus.csv


Run the following script, replacing paths, as applicable:

python data/preprocessing/ipums/ei_preprocessing_ipums_full_census.py \
--run_set "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17" \
--input_csv_data_file usa_0000xfullcensus.csv \
--output_dir [REPLACE WITH FULL PATH]/data/datasets

This will save the individual 2x2 datasets to the data/datasets folder.

9. For the sampled census data (after logging-in to https://usa.ipums.org/usa/ and clicking "Get Data"):

Select Samples: USA SAMPLES:
1850 1%
1860 1%
1870 1%
1880 10%
1900 5%
1910 1%
1920 1%
1930 5%
1940 1%
1950 1%
1960 5%
1980 5%
1990 5%
2000 5%
2010 10%

Select the following variables:
YEAR
DATANUM
SERIAL
HHWT
STATEICP
COUNTY
MCD
GQ
STDMCD
PERNUM
PERWT
SEX
RACE
NATIVITY
CITIZEN
LANGUAGE
SPEAKENG
LIT
EMPSTAT
LABFORCE
CLASSWKR
POVERTY
VETSTAT


Create the extract (retaining all other values at their default settings).

Save the resulting .csv file and note the location (here, we use the placeholder 'usa_0000xsamplecensus.csv'). Add the following line to the end of the file:

"9999,,,,,,,,,,,,,,,,,,,,,,,,,"

as, for example, by running:

echo "9999,,,,,,,,,,,,,,,,,,,,,,,,," >> usa_0000xsamplecensus.csv

10. For the sampled ACS data (after logging-in to https://usa.ipums.org/usa/ and clicking "Get Data"):

Select Samples: USA SAMPLES:
2005 ACS
2006 ACS
2007 ACS
2008 ACS
2009 ACS
2010 ACS
2011 ACS
2012 ACS
2013 ACS
2014 ACS
2015 ACS
2016 ACS

Select the following variables:
YEAR
DATANUM
SERIAL
HHWT
STATEICP
COUNTY
GQ
PERNUM
PERWT
SEX
RACE
CITIZEN
LANGUAGE
SPEAKENG
HCOVANY
EMPSTAT
LABFORCE
CLASSWKR
POVERTY
VETSTAT


Create the extract (retaining all other values at their default settings).

Save the resulting .csv file and note the location (here, we use the placeholder 'usa_0000xsampleacs.csv'). Add the following line to the end of the file:

"9999,,,,,,,,,,,,,,,,,,,,,,,,,"

as, for example, by running:

echo "9999,,,,,,,,,,,,,,,,,,,,,,,,," >> usa_0000xsampleacs.csv

11. Run the following script, replacing paths, as applicable:

python data/preprocessing/ipums/ei_preprocessing_ipums_census_acs_samples.py \
--run_set "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23" \
--input_csv_census_sample_data_file usa_0000xsamplecensus.csv \
--input_csv_acs_sample_data_file usa_0000xsampleacs.csv \
--output_dir [REPLACE WITH FULL PATH]/data/datasets

This will save the individual 2x2 datasets to the data/datasets folder.

12. To generate the bounds and summary statistics, from this directory run

bash generate_bounds_for_all_datasets.sh

13. Finally, to generate Tables 1 and 2, run the following from this directory:

bash generate_tables.sh >output/expected_output/table1_and_table2.txt

This will save the data for Table 1 and Table 2 to the file output/expected_output/table1_and_table2.txt.
Note that this file already exists and will be overwritten. You can verify the preprocessing of the data by comparing to the file that is provided.


## Dependencies

R libraries:
sandwich
ei
eco
eiPack

Python 2.7

$ Rscript --version
R scripting front-end version 3.3.3 (2017-03-06)

This code has been tested on macOS Sierra. It should run on Linux machines.
